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Abstract 

Background: Cancer progression is associated with genomic instability and an accumulation of gains and losses of 
DNA. The growing variety of tools for measuring genomic copy numbers, including various types of array-CGH, SNP 
arrays and high-throughput sequencing, calls for a coherent framework offering unified and consistent handling of 
single- and multi-track segmentation problems. In addition, there is a demand for highly computationally efficient 
segmentation algorithms, due to the emergence of very high density scans of copy number. 

Results: A comprehensive Bioconductor package for copy number analysis is presented. The package offers a unified 
framework for single sample, multi-sample and multi-track segmentation and is based on statistically sound penalized 
least squares principles. Conditional on the number of breakpoints, the estimates are optimal in the least squares 
sense. A novel and computationally highly efficient algorithm is proposed that utilizes vector-based operations in R. 
Three case studies are presented. 

Conclusions: The R package copynumber is a software suite for segmentation of single- and multi-track copy 
number data using algorithms based on coherent least squares principles. 

Keywords: Copy number, aCGH, Segmentation, Allele-specific segmentation, Penalized regression, Least squares, 
Bioconductor 



Background 

In cancer, the path from normal to malignant cell involves 
multiple genomic alterations including losses and gains 
of genomic DNA. A long series of studies have demon- 
strated the biological and clinical relevance of studying 
such genomic alterations (see, e.g., [1,2] and references 
therein). Genome-wide scans of copy number alterations 
may be obtained with array-based comparative genomic 
hybridization (aCGH), SNP arrays and high- throughput 
sequencing (HTS). After proper normalization and trans- 
formation of the raw signal intensities obtained from such 
technologies, the next step is usually to perform segmen- 
tation to identify regions of constant copy number. Many 
segmentation algorithms are designed to analyse samples 
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individually (see, e.g., [3-16] and references therein), while 
most studies involve multiple samples, multiple tracks, 
or both. Joint handling of multiple samples is compu- 
tationally and conceptually challenging, see e.g. [17,18]. 
Most systematic approaches for this problem are based on 
individual segmentation of each sample followed by post- 
processing to combine results across samples (see, e.g., 
[18] and references therein), while some recent publi- 
cations propose strategies for joint segmentation of all 
samples [19-23]. Recently, the emergence of new tech- 
nologies have pushed the limit of genomic resolution, 
opening new vistas for studying very short aberrations, 
including aberrations affecting only part of a gene or gene 
regulatory sites in the DNA. A major challenge raised by 
these novel technologies is the steadily growing length of 
the data tracks, which drastically increases the demand for 
computationally efficient algorithms. The occurrence of 
extreme observations (outliers) of biological or technical 
origin pose an additional challenge, as most segmentation 
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methods are substantially affected by such observations. 
Picard et al. [6] propose a least squares based segmenta- 
tion method that results in a piecewise constant fit to the 
copy number data. Their approach assumes that the user 
either supplies the desired number of segments or leaves 
to the method to automatically determine this number. 
In this paper, we describe a related approach. In particu- 
lar, the proposed method utilizes penalized least squares 
regression to determine a piecewise constant fit to the 
data. Introducing a fixed penalty y > 0 for any differ- 
ence in the fitted values of two neighboring observations 
induces an optimal solution of particular relevance to 
copy number data: a piecewise constant curve fully deter- 
mined by the breakpoints and the average copy number 
values on each segment. The user denned penalty y essen- 
tially controls the level of empirical evidence required to 
introduce a breakpoint. Given the number of breakpoints, 
the solution will be optimal in terms of least squares error. 

To achieve high processing efficiency, dynamic pro- 
gramming is used (see [24]). To further increase compu- 
tational efficiency, a novel vector based algorithm is pro- 
posed, and even further speed optimization is obtained 
through heuristics. A central aim of the present work has 
been to provide methodology and high-performance algo- 
rithms for solving single- and multiple-track problems 
within a statistically and computationally unified frame- 
work. All proposed algorithms are embedded in a com- 
prehensive software suite for copy number segmentation 
and visualization, available as the Bioconductor package 
copynumber. Main features of the package include: 

• Independent as well as joint segmentation of multiple 
samples 

• Segmentation of allele-specific SNP array data 



• Preprocessing tools for outlier detection and 
handling, and missing value imputation. 

• Visualization tools 

Implementation 

Systems overview 

The copynumber package provides functionality for 
many of the tasks typically encountered in copy number 
analysis: data preprocessing tools, segmentation meth- 
ods for various analysis scenarios, and visualization tools. 
Figure 1 shows an overview of the typical work flow. Input 
is normalized and log2 -transformed copy number mea- 
surements from one or more aCGH, SNP-array or HTS 
experiments. Allele-frequencies may also be specified for 
the segmentation of SNP-array data. It is strongly rec- 
ommended to detect and appropriately modify extreme 
observations (outliers) prior to segmentation, as these can 
have a substantial negative effect on the analysis. For this 
purpose, a specially designed Winsorization method is 
included in the software package. A missing-value impu- 
tation method appropriate for copy number data is also 
available. 

Segmentation methods for three different scenarios 
(single sample, multi-sample and allele-specinc segmenta- 
tion) are implemented in the package. All these methods 
are referred to as Piecewise Constant Fitting (PCF) algo- 
rithms and seek to minimize a penalized least squares 
criterion. In single sample PCF, individual segmentation 
curves are fitted to each sample. In multi-sample PCF, 
segmentation curves with common segment borders are 
simultaneously fitted to all samples. In allele-specific PCF, 
the segmentation curves are fitted to bivariate SNP-array 
data, providing identical segment borders for both data 
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Whole-genome plots 
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Figure 1 An overview of the copynumber package. Depending on the aim of the analysis, the input will be copy number data and possibly 
allele frequencies from one or more experiments. Preprocessing tools are available for outlier handling and missing data imputation, and three 
different methods handle single sample, multi-sample and allele-specific segmentation. Several options are also available for the graphical 
visualization of data and segmentation results. 
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tracks. A set of graphical tools are also available in the 
package to visualize data and segmentation results, and to 
plot aberration frequencies and heatmaps. Also included 
are diagnostics to explore different trade-offs between 
goodness-of-fit and parsimony in terms of the number 
of segments. In the remaining part of this section, a for- 
mal description of the algorithms is given. However, note 
that these details are not a prerequisite for reading later 
sections or for using the copy number package. 

Preprocessing: Outlier handling 

A challenging factor in copy number analysis is the fre- 
quent occurrence of outliers - single probe values that 
differ markedly from their neighbors. Such extreme 
observations can be due to the presence of very short 
segments of DNA with deviant copy numbers, to tech- 
nical aberrations, or a combination. When identification 
of CNVs is a purpose of the study, the multi-sample 
method described below may be applied for such detec- 
tion. However, when the focus is on detection of broader 
aberrations, the potentially harmful effect of extreme 
observations on aberration detection methods induces 
a need for outlier handling procedures (see, e.g., [3,6]). 
Since the copynumber package is based on least squares, 
an extreme observation will tend to cause the detection of 
a short segment. When searching for broader segments, 
such short (and abundant) segments will represent noise 
and may also affect the identification of other segments. 
We therefore now describe a procedure for reducing the 
effect of extreme observations, while the effects of this 
method will be considered in the Results and discus- 
sion section. Winsorization is a simple transformation 
reducing the influence of outliers by moving observations 
outside a certain fractile in the distribution to that fractile 
(see [25]). For identically distributed observations 
yi,...,y p , the corresponding Winsorized observations 
are denned as yj = ^{yj) where 



*(y) = *(y|0) = 



-0, y<-6 
0, y>0 
y, otherwise. 



Here, 0 > 0 determines how extreme an observation 
must be to be relocated, as well as the replacement value. 
A common choice is 0 = rs, where typically r e[ 1.5,3] 
and s is a robust estimate of the standard deviation (SD). A 
robust scale estimator is the Median Absolute Deviation 
(MAD), denned as the median of the values | yj — m\, where 
rh is the median of y\, . . . ,y p . For normally distributed 
observations, sm = 1.4826 • MAD corresponds to SD. 

Winsorization of copy number data may be achieved by 
first estimating the trend in the data and then Winsoriz- 
ing the residuals. Let the observations representing copy 
numbers in p genomic loci be y = (yi, . . . ,y p ), ordered 



according to genomic position. A simple estimator of the 
trend is the median filter. The trend estimate rhj in the 
;th locus is then given by the median of yj-k> • • ->yj+k 
for some k > 0, e.g. k = 25. The SD of the residuals 
yj — rhj may then be estimated with the MAD estimator 
sm> and Winsorized observations y™, . . . , y™ obtained by 
yj = rhj + ty(yj — rhj \ tsm)< Often, such simple and fast 
Winsorization is sufficient. However, copynumber also 
includes an iterative procedure with improved trend esti- 
mation based on the segmentation procedures described 
below (see Additional file 1). 

Single sample segmentation 

Consider first the basic problem of obtaining individual 
segmentations for each of a number of samples. Suppose 
attention is restricted to one chromosome arm on one 
sample. For each of the p loci, the obtained measurement 
can be conceived of as a sum of two contributions: 



y } = Zj + €j 



(1) 



where Zj is an unknown parameter reflecting the actual 
amount of sample DNA at the j'th locus and €j represents 
measurement noise. A breakpoint is said to occur between 
probe j and j + 1 if Zj ^ Zj+\. The sequence zi,...,z p 
thus implies a segmentation S = . . . Jm) of the chro- 
mosome arm, where I\ consists of the probes before the 
first breakpoint, h consists of the subsequent probes until 
the second breakpoint, and so on. To fit model (1), we 
minimize the penalized least squares criterion 



J2(yj-zj) 2 + v\s\ 



(2) 



7=1 



with respect to the sequence zi,...,z p . Here, \S\ denotes 
the number of segments in S, and y > 0 is a constant that 
controls the trade-off between seeking a good fit to the 
data (the first term) and restraining the number of level 
shifts (the second term). The minimizer z\, . . . , z p of (2) is 
fully determined by the segmentation S, since the best fit Zj 
on a given segment / is the average yj of the observations 
on that segment. Substituting the latter into (2) we obtain 
the equivalent criterion: 



£(S|y>x) = EE^-^/) 2 + y -\s\ 

IeS jel 



(3) 



IeS jel 



E£>) 2 /"/ + x-lsi (4) 

IeS jel 

where nj denotes the number of probes in segment /. Note 
that the first term in (4) does not depend on the seg- 
mentation S, hence minimization of (3) is equivalent to 
minimizing 

L'(S | y, Y ) = ~ E ( &' )2 / w ' + y ■ I 5 '" ^ 
IeS jel 
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Naive optimization of the cost function (5) with respect 
to the segmentation S requires examination of every pos- 
sible division of the probes on a chromosome arm into 
segments. For large p, this is not practically feasible. How- 
ever, a much more efficient implementation based on 
dynamic programming and requiring only 0(p 2 ) oper- 
ations is available. Dynamic programming is a method 
for solving complex problems by breaking them down 
into simpler subproblems, and specifically for problems 
where global decisions can be decomposed into a series 
of nested smaller decision problems. The crucial observa- 
tion that allows the use of dynamic programming to solve 
the present segmentation problem is that the optimal 
segmentations on each side of a breakpoint are mutu- 
ally independent. This can be used to iteratively build up 
a solution to the global segmentation problem. Suppose 
we know the optimal segmentations from the first probe 
up until the (k — l)st probe. Assume furthermore that 
the optimal segmentation for the k first probes contains 
breakpoints. Then the optimal segmentations from the 
last of these breakpoints and downwards has already been 
computed. Thus, by solving the above subproblems iter- 
atively for increasing /c, each step can utilize the results 
from the previous steps (see [24]). More formally, assume 
that the optimal segmentation of 1 . . . r and the corre- 
sponding total error e r are known for all probes r < /<. To 
extend the solution to r = /<, first note that there must be 
a last segment starting at some index j < k. From (5) we 
find that the cost term associated with that segment is: 



Then the total error for the optimal solution up until 
index k is found by minimizing the cost over the possible 
start positions j of the last segment. This cost consists of 
three terms: the cost of the last segment (dj), the optimal 
cost of the segmentation up until that point (ej-i) and the 
penalty for the break point (y): 

e k = min (df + e 7 _i + y) 

where eo = 0. The main work load of the above com- 
putation is to determine dfj for all 1 < j < k < p. In 
interpreted languages (such as R) where loop execution 
is often quite inefficient, a considerable improvement of 
performance may be obtained by utilizing native-language 
vector operations. Let aj = Ylr=jyr> a £ = (^p • • • > <zjjp 
and d* = (4>...>4)- Then 

we may calculate all required 
coefficients through a simple recursion: 

&k =[a*_i 0] +y k 
d k = -a k *a k /(k : 1) 



where (k : 1) = (k,k — 1, . . . , 1) and operators are vector- 
based. Hence, addition of a vector and a scalar adds the 
latter to each component of the former, and multiplica- 
tions and divisions are performed component-wise on 
the operands, e.g., a k *a k =[ (a^) 2 , . . . , (a k k ) 2 ]. Algorithm 1 
summarizes the computations. 

Algorithm 1 : Single sample PCF 

Input: Log- transformed copy numbers y\, . . . , y p ; penalty 

y > 0. 

Output: Segment start indices s\,...,sm and segment 
averages y\, . . . ,yM- 

1. Calculate scores by letting ao =[ ] and eo = 0, and 
iterate for I< = 1 . . .p: 

• a /c =[a*_i 0] +y k 

• d k = -a/<*a/</(/< : 1) 

• =[ ejt-i mm(d k + e*_i + y)} 

storing also the index t k e {1, 2, ...,/:} at which the 
minimum in the last step is achieved. 

2. Find segment start indices (right to left) 

si = t p ,S2 = t Sl -\ . . . ,sm = 1> where M > 1. 

3. Find segment averages y m = aveO/ 5w , . . . ,y 5(m _ 1} _i) 
for m = 1, . . . ,M, where so = p + 1. 

Throughout the paper we will tacitly assume that the 
penalty for the ith sample is yi = ycrf, where of is the esti- 
mated sample specific residual variance. In this way, we 
avoid scale dependency, and obtain consistent results for 
samples with equal signal-to-noise ratios. Such rescaling 
is also done by default in copynumber. Note that replac- 
ing the data j/j for the ith. sample with j/j/cr; for; = 1, . . . ,p, 
and rescaling after estimation, has the same effect. In 
copynumber, the algorithm has also been extended to 
allow a constraint on the least number of probes in a 
segment. 

Multi-sample segmentation 

Detection of very short or very low amplitude segments 
requires a small penalty y, with low specificity as a poten- 
tial result. However, when such segments are common to 
several samples, joint segmentation of multiple samples 
is an additional mechanism to increase sensitivity. This is 
a main motivation for introducing multi-sample segmen- 
tation methods that impose common breakpoints across 
all samples. Such methods are potentially useful for dis- 
covery of copy number variations (CNVs) and in those 
instances where the origin of the samples implies that 
segment boundaries are partly shared. Multi-sample seg- 
mentation with high penalty on breakpoints may also be 
used to obtain low-dimensional descriptions of the data, 
which may form the basis for defining variables to be 
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used in statistical procedures relating aberration patterns 
to clinical outcome. In the following, we describe a direct 
generalization of single sample PCF to handle multiple 
samples simultaneously, obtaining common breakpoints 
for all the samples with minimal residual sum of squares 
for a given number of breakpoints. Suppose copy number 
measurements y; = (y\, . . . ,y l p ) for samples i = 1, 2, ... ,n 
are obtained at the same loci in each sample. By direct 
generalization of the criterion (3), we seek in multi-sample 
PCF the minimizer of 

n 

L(S\ yi) ... )yni y) = Y,L(S\yuy) (6) 

i=l 

where L(S | y, y) is defined as in (3) and S is a given seg- 
mentation common to all samples. 



Algorithm 2: Multi-sample PCF 

Input: Log-transformed copy numbers for n samples 
yi, . . . , y p e R n ; penalty y > 0. 

Output: Common segment start indices si,...,sm and 
segment averages yi, . . . , jm £ R n < 

1. Calculate scores by letting An =[ ] and en = 0, and 
iterate for k = 1 . . .p: 

• A/,=[A^_i 0]+y/< 

• d k = -l T (A k *A k )/(k:l) 

• =[ e^_i min(d£ + e /c _i + ny)\ 

storing also the index t k e {1, 2, . . . , k] at which the 
minimum in the last step is achieved. 

2. Find segment start indices (right to left) 

si = t p ,S2 = t Sl -\ . . . ,sm = 1> where M > 1. 

3. Find segment averages y m = ave(y 5w , . . . , y 5(m _ 1} _i) 
for m = 1, . . . ,M, where so = p + 1. 

The multi-sample PCF algorithm (see Algorithm 2) is 
in principle quite similar to single sample PCF. However, 
when updating the solution from k — 1 to /<, the sums 
and sums of squares for the segments must be accumu- 
lated and stored separately for each sample. This can still 
be done iteratively, implying that the computational effort 
will be approximately equal to carrying out single sam- 
ple PCF on the same set of samples. Since the noise level 
may vary between samples, normalisation of the samples 
prior to segmentation and corresponding rescaling after 
estimation is advisable. It may also be desirable to scale 
the samples, e.g. to adjust for different tumor percentages. 
Thus, prior to running multi-sample PCF, we may replace 
y* by w{y l /di for i = 1, . . . , n, where w/ are weights and cr; 
is an estimate of the SD. In copy number normalization is 
performed by default for multi-sample PCF while further 
weighting is left as an option for the user. 



Allele-speciflc segmentation 

The PCF algorithm is easily adapted to variants of the 
basic segmentation problem discussed above. Here, we 
consider an adaptation to handle SNP genotype data. We 
then have for each SNP locus a measurement of (total) 
copy number (logR) as well as the B allele frequency 
(BAF). We may also have measurements of copy number 
only for a number of additional loci. The B allele fre- 
quency is a number between 0 and 1 indicating the allelic 
imbalance of a SNP. For a homozygous locus we have BAF 
close to 0 or 1, while for a heterozygous locus with an 
equal number of the two alleles A and B, BAF will be 
close to 0.5. An imbalance between the number of As and 
Bs results in a BAF value deviating from 0.5. A change 
in the total number of copies of a segment will alter the 
logR value, hence result in a level shift in the logR track. 
Unless the copy number change is balanced with respect 
to the two alleles, the BAF value will also change. In cases 
involving multiple copy number events at the same locus, 
the change may manifest itself only in one of the two 
tracks. For example, a loss of one copy of A followed by 
a gain of one copy of B would lead to unchanged logR 
and changed BAF. The purpose of the allele-specific PCF 
algorithm is to detect breakpoints for all such events. It 
fits piecewise constant curves simultaneously to the logR 
and the BAF data, forcing breakpoints to occur at the 
same positions in both. We emphasize that the purpose of 
the allele-specific PCF algorithm is segmentation only and 
not to make allele-specific copy number calls. However, 
such calls can be made on the basis of the segmentation 
described below, and this is done e.g. in the ASCAT algo- 
rithm (Allele-Specific Copy number Analysis of Tumors) 
which estimates allele-specific copy numbers as well as 
the percentage of cells with aberrant DNA and the tumor 
ploidy [26]. Suppose the data are given by (rj,bj) for j = 
l,...,p, where rj denotes the logR value and bj the BAF 
value at the ;th locus. For copy number probes, only rj is 
given and bj will be missing (henceforth coded as NA). For 
germline homozygous probes, the BAF values are nonin- 
formative and should be omitted from the analysis. If the 
germline genotype is known (e.g. from a matching blood 
sample), the user should replace the corresponding BAF 
values by NA. If the genotype is not known, the algorithm 
will apply a proxy to handle this issue (see below). Prior to 
segmentation, the allele-specific PCF algorithm performs 
the following steps: 

• The BAF data are mirrored around 0.5 by replacing bj 
with 1 - bj if bj > 0.5. 

• BAF values bj < 0 are replaced by NA. By default 
6 = 0.1. If germline homozygous probes have 
previously been replaced by NA's, let 0=0. 

• Let bi,...,b m denote the nonmissing B allele 
frequencies. Corresponding copy number values 
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ri,...,r m are found by pairing each logR probe with 
the nearest B-allele probe (ignoring those with 
missing values) and then averaging logR values paired 
to the same B-allele probe. Finally, let 
yi = (hi, ...,b m ) and y 2 = (h> . . . , r m ). 

The remaining part of the allele-specific PCF algorithm 
is then essentially an adaptation of the multi-sample PCF 
algorithm applied to two samples. It finds a common 
segmentation S for the two tracks by minimizing the 
penalized criterion 

L(S | yi,y 2 , y) = L(S | yi, y) + L(S | y 2 , y) (7) 

where L(S | y) is defined as in (3). 

Fast implementations of PCF 

The PCF algorithms may be generalized to allow break- 
points only at certain prespecified positions. Combined 
with simple heuristics, this may be used to further 
enhance the computational speed of PCF. For brevity we 
describe only the single sample segmentation case here; 
however the copy number package contains fast imple- 
mentations of both single- and multi-sample PCF. Com- 
putationally inexpensive methods can be used to identify a 
set of potential breakpoints among which the breakpoints 
of the solution to (3) are highly likely to be found. Sup- 
pose we restrict our attention to such a set of potential 
breakpoints. All relevant information for solving the opti- 
mization problem in (3) may then be condensed into three 
arrays containing the number of observations between 
two potential breakpoints, the corresponding sum of the 
observations and the sum of squares. Based on these 
quantities, PCF may be used with straightforward mod- 
ifications. Since the algorithm is of order 0(q 2 ), where 
q is the number of potential breakpoints, the potential 
increase in speed is substantial. Algorithm 3 outlines the 
procedure, while possible heuristics for finding poten- 
tial breakpoints are discussed below. One way to identify 
potential breakpoints is to use high-pass filters, i.e. a fil- 
ter obtaining high absolute values when passing over a 
breakpoint. The simplest such filter uses for each posi- 
tion i the difference Ylj=i+i J) ~ Ylj=i-k+i J) f° r some k To 
reduce artifacts due to the abrupt edges of such a filter, 
the copy number implementation assigns half weight to 
the outer 1/3 of the observations on each side. Fast imple- 
mentations of such filters in R may be obtained using the 
cum sum function. We currently use two filters with k=3 
and 12, respectively; additionally the single sample PCF 
implementation includes a filter searching for aberrations 
of length equal to the lowest accepted one. These fil- 
ters together identify about 15% of the probe positions as 
potential breakpoints. An additional way to speed up the 
computations on long sequences is to initially divide the 



sequence into overlapping subsequences, and iteratively 
find the solution. 

Having found the solution for the m first subsequences, 
we use high-pass filters to detect potential breakpoints for 
subsequence m + 1, and then use the fast PCF algorithm 
with the latter potential breakpoints as well as those found 
by PCF on earlier subsequences. The intention behind 
this iterative approach is to reduce potential boundary 
effects. Due to the quadratic order of the algorithm, this 
division into subsequences implies a substantial efficiency 
gain. In copynumber, subsequences are used when the 
chromosomal arm length exceeds 15000 probes, with sub- 
sequences of length 5000 and overlap 1000. 

Algorithm 3: Fast PCF 

Input: Log- transformed copy numbers yi, . . . ,y p ; penalty 
y > 0. 

Output: Segment start indices s\,...,sm and segment 
averages y\, . . .,yM- 

1. Apply heuristics to find potential breakpoints 
ro, n, . . . , r q , where r 0 = 1 and r q = p + 1. 

2. Form aggregates by letting u k = Y^jLrl 1 Ji> wnere 
k = l,...,q. 

3. Calculate scores by letting ao =[ ], Co =[ ], eo = 0, 
and iterate for k = l,...,q: 

• a k =[a*_i 0] +u k 

• c k =[ck-i 0] +r k - r k -i 

• d k = -a k *a k /c k 

• e* =[ ejt-i min(d /c + e^_i + y)] 

storing also the index t k e {1, 2, . . . , k} at which the 
minimum in the last step is achieved. 

4. Find segment start indices (right to left) 

si = r tq ,s 2 = r t$1 _ x ... ,5m = 1, where M > 1. 

5. Find segment averages y m = ave(y Sm , . . . , y S(w _i)-i) 
for m = 1, . . . ,M, where so = P + 1. 

Results and discussion 

Selection of penalty 

The selection of parameters determining the trade-off 
between high sensitivity (i.e. few missed true aberrations) 
and high specificity (i.e. few false aberrations) is important 
in all segmentation procedures. In PCF, this is controlled 
by the single penalty parameter y. A number of general 
model selection criteria exist, such as Cross-Validation, 
the Akaike Information Criterion (AIC) and the related 
Schwarzs Bayesian Information Criterion (BIC). How- 
ever, model selection for copy number segmentation is 
complicated by several factors. First, the distribution of 
the data at hand may vary substantially. An important 
example is the presence of local trends mimicking smaller 
aberrations; such low-amplitude "waves" in the data may 
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e.g. be due to variations in GC-content (see, e.g., [9]). Sec- 
ond, the purpose of the analysis may favor either higher 
sensitivity or higher specificity. For example, in clinical 
studies aimed at finding prognostic markers, the main 
focus may be on the most pronounced and commonly 
occurring deviations, while detecting more sporadic aber- 
rations may simply increase the noise level. In our expe- 
rience, the above model selection criteria tend to give too 
small penalty estimates and thus undersmooth the data. 
This is consistent with previous investigations showing 
that AIC and BIC are not appropriate for the breakpoint 
problem (for details and discussions of other alterna- 
tives, see [6,27]). Simulation studies of specificity may 
suggest a lower bound on the penalty y. For this pur- 
pose, sequences of independent and normally distributed 
observations without underlying aberrations were gener- 
ated, and PCF was applied with different choices of y. At 
y = 12 the number of falsely called aberrations is about 
0.5 per 10.000 probes, at y = 10 roughly 2 per 10.000 
probes, at y = 8 roughly 10 per 10.000 probes, and for 
y < 6 the number of falsely called aberrations is sub- 
stantial. This suggests y & 8 — 12 as a lower bound. 
Since the number of false aberrations per chromosome 
increases with increasing probe density, low values are 
most relevant for arrays with low probe density. In the 
presence of local trends, the number of false calls tends to 
inflate and the penalty should thus be increased above the 
lower bound. A fairly conservative penalty of y =40 is 
the default in the copy number package. This provides a 
starting point for exploration of the best penalty value for 
the specific problem at hand, however a systematic inspec- 
tion of results obtained for different penalties is advisable. 
Figure 2 illustrates the effect of changing y. Notice that 
the main features in the data are captured across the whole 
range of y -values, while finer details are only evident for 
smaller values. 

Aberration calling 

Aberration calling is used for detection of recurring alter- 
ations and in many other analyses. Introducing a parame- 
ter 0 > 0 that determines the sensitivity of the aberration 
calling (and hence what to consider as biologically signif- 
icant aberrations), we call probes for which Zj < —0 as 
losses and probes for which Zj > 0 as gains. Optionally, 
different thresholds 0+ and 0- may be used for gains and 
losses. To examine how well PCF aberration calling man- 
ages to distinguish between normal and aberrant regions, 
performance was compared with a very accurate mea- 
surement method. Specifically, aberration calls obtained 
with PCF on the basis of 1.8M SNP array data on 40 
samples were compared with calls obtained with MLPA 
(Multiplex Ligation-dependent Probe Amplification; see 
Additional file 2 for details). Since MLPA is limited to a 
small set of genomic positions, only 88 loci were used for 



the comparison. In all samples combined, MLPA identi- 
fied 546 aberrant and 2542 normal loci (the remaining 
432 loci were ambiguous or unclassified and left out of 
the analysis). Using the MLPA-classification as the gold 
standard, the sensitivity and specificity of PCF aberration 
calling were calculated for a range of threshold values 0. 
Figure 3 shows the resulting ROC curves, and panel (a) 
illustrates how the results for PCF depend on the choice of 
y. Importantly, aberration calling appears to be only mod- 
erately dependent on the choice of parameter values over 
a fairly wide range of y -values. 

Single- versus multi-sample segmentation 

Whether the initial segmentation of a dataset is most 
appropriately done using single- or multi-sample methods 
depends both on the purpose and the data. Using methods 
with common breakpoints for samples will increase the 
power for detecting concordant but quantitatively weak 
segments, while it will reduce the ability of detecting 
(or correctly positioning) discrepant breakpoints. A well 
known example of aberrations with common boundaries 
is germline copy number variants (CNVs), thus some pro- 
posed algorithms for CNV detection utilize segmentation 
with joint segment borders (e.g. [21]). Another impor- 
tant example of samples with (partially) common segment 
boundaries arises when the samples originate from differ- 
ent clones of the same (early) tumor. This is illustrated 
below in two examples, one on disseminated tumor cells 
from breast carcinomas, the other on tumor clones found 
at successive biopsies from lymphoma patients. Recent 
reports [28] on marked variations in aberration patterns 
within the same tumor is likely to increase the number 
of studies using several samples taken from each tumor. 
What is common as well as what differs in the aberration 
patterns will then be of interest, motivating the combined 
use of single- and multi-sample methods. In applications 
searching for genomic copy number hot spots with rel- 
evance to cancer development, it may be important to 
utilize the precise delineation of the aberrations found in 
each sample, and thus the use of single-sample methods is 
most appropriate. The identification of the relevant recur- 
rent aberrations may then utilize post processing tools 
like GISTIC [29], KCsmart [30] or cghMCR [31] (see also 
the review in Rueda [18]). If focus is on clustering sam- 
ples or on constructing regression variables for relating 
more broad aberrations to clinical outcome, one may con- 
sider using multi-sample methods. However, to be useful, 
the estimates from the multi-sample methods should in 
a proper way reflect the main information content in 
each sample. This implies that a multi-sample analysis 
should result in estimates approximating those obtained 
from single-sample analyses. Figure 4 shows heatmaps of 
results from single- and multi-sample PCF for 49 breast 
cancers from the so-called MicMa data set (see [32] and 
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Figure 2 The effect of changing the penalty y in PCF. The plot in the upper left corner shows the copy number data for a selected chromosome 
(in this case, chromosome 1 7), while the lower right plot shows the number of segments found by PCF as a function of y. The remaining plots show 
the segmentation curves for ten different values of y. The plot was created with the function plotGamma in copynumber. 



Additional file 2) analyzed on 244K Agilent arrays. The 
main features appear to be well reflected in the multi- 
sample analysis. On a more detailed level, differences can 
be observed: the multi-sample solution misses some short 
aberrations occurring in only a few samples, aberration 
borders are sometimes slightly shifted, and longer seg- 
ments obtained with single sample PCF are often divided 
into subsegments with slightly different copy number esti- 
mates. The moderate difference between the results of 
single- and multi-sample PCF was also confirmed by a 
comparison of the ability to detect specific aberrations as 
revealed by comparison to MLPA analyses, see Figure 3b. 
This indicates that at least for cancer types where aberra- 
tions are focused in certain areas of the genome, methods 



using joint boundaries might be considered for construct- 
ing variables to be used in further statistical analysis. 

Comparing tracks: Analysis of disseminated tumor cells 

Disseminated tumor cells (DTCs) are detected in the bone 
marrow of some patients with breast carcinomas. The 
presence of DTCs in the bone marrow identifies patients 
with less favorable outcome (see, e.g., [33]), and genomic 
characterization of such cells is of substantial interest. It 
is still an open question to what extent the aberration 
patterns in DTCs correspond to those found in the pri- 
mary tumor; the DTCs may potentially have obtained new 
aberrations or, alternatively, the cells may have originated 
from (early) subclones of the tumor with less aberrations. 
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Figure 3 Aberration calling accuracy. The ROC-curves show the 
sensitivity and specificity for a sequence of thresholds as calculated 
by comparing aberration calls to the classifications made in a 
MLPA-analysis on the same data material. In panel (a), classifications 
were made based on PCF segmentations found for a wide range of 
y-values. Notably, the classification accuracy is not affected much by 
the choice of y, except to some extent for very low values. Panel (b) 
shows that aberration calls based on multi-sample PCF 
segmentations are about as accurate as those based on single sample 
PCF. In panel (c), ROC-curves are shown for calls made on the basis of 
the segmentations found by PCF and CBS, a running median with 
window size 50 and raw data. In terms of aberration calling accuracy, 
PCF and CBS give nearly the same results, while using the running 
median gives slightly less accurate classifications. Using only raw data 
leads to much poorer accuracy. Note the range on the ordinate axis. 



It is possible to analyze single cells using aCGH; how- 
ever, currently the noise level is high, making it difficult 
to draw definitive conclusions from a single cell However, 
since segment boundaries are assumed to be partly com- 
mon, we tested the multi-sample PCF algorithm on breast 
cancer cases from which DTCs were available (cf. [32] 
and GEO accession number GSE27574). Figure 5 shows 
the results on a set of DTCs and the corresponding pri- 
mary tumor from one such patient. Since multi-sample 
PCF is used, segment boundaries are common, while the 
estimated level in each segment is determined by the indi- 
vidual DTC/primary tumor. In Figure 5, two of the single 
cells seem to have a pattern similar to the primary tumor. 
The last one has an essentially flat (balanced) profile and 
is likely to be a hematopoietic cell misclassified as a tumor 
cell (separation of DTCs from other cells is often difficult). 
These data thus indicate that the aberration pattern of the 
DTCs quite closely reflect that of the primary tumor. With 
only two single cells present, Figure 5 primarily shows that 
DTCs inherit the aberrations of the primary tumor; with 
higher numbers of cells, multi-sample PCF may also be 
used to search for aberrations found in DTCs but not in 
the tumor. 



Defining variables: Genetic evolution in follicular 
lymphoma 

Follicular lymphoma is normally a slowly progressing 
malignancy, but relapses are common and the disease 
is usually fatal. In a recent study, 100 biopsies from 44 
patients diagnosed with follicular lymphoma were eval- 
uated using a custom-made aCGH platform consisting 
of 3k BAC/PAC probes [34]. A whole-genome view of 
aberration frequencies (based on single sample PCF) 
and highly correlated aberrations (based on multi-sample 
PCF) are shown in Figure 6. 

Although the delineation of segments varied between 
biopsies, several areas with a high frequency of 
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Figure 4 Comparison of results from single sample and multi-sample PCF. In single sample PCF, y = 40 was used, while in multi-sample PCF, 
y = 120 was used to limit the number of segments. Note that the estimated aberration patterns are quite similar; indicating that the multi-sample 
PCF estimates (panel b) should be well suited as variables in statistical analyses. On a more detailed level there are differences, e.g., longer segments 
in the single sample analysis (panel a) are divided into subsegments with slightly different estimates in the multi-sample analysis. The plot was 
created with the function plotHeatmap in copynumber. 
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Figure 5 Analysis of disseminated tumor cells (DTCs) with multi-sample PCF. The top panel shows the primary tumor and the three panels 
below show single cells morphologically classified as DTCs (all for chromosome 2). High noise levels make separate analyses of each DTC difficult; 
co-analyzing multiple DTCs, possibly together with a primary tumor, thus facilitates an evaluation of the degree of correspondence between the 
aberration patterns. In the present case, two DTCs seem to have aberration patterns similar to the primary tumor, while the last cell has an 
essentially flat (balanced) pattern and is probably a hematopoietic cell misclassified as a DTC. The plot was created with the function plotChrom 
in copynumber. 



aberrations could be detected. To try to identify aber- 
rations with prognostic potential, we therefore found 
a common segmentation for the initial biopsies taken 
from each of the 44 patients using the multi-sample PCF 
algorithm. Removing very low variance segments, 93 
segments remained. The corresponding copy number 
estimates were used as covariates in a multivariate Cox 
proportional hazards regression. This revealed 11 seg- 
ments for which gains were significantly associated with 



a survival disadvantage. A particularly strong associa- 
tion was detected for gains on chromosome X in male 
patients. To study the relation between successive biop- 
sies taken from the same patient, multi-sample PCF was 
applied to each patient individually (see Additional file 2). 
As expected, many aberrations are common, but interest- 
ingly, some aberrations are present in early biopsies and 
not in later ones. This contradicts the hypothesis of linear 
development which states that late tumor clones arise 
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Figure 6 Whole-genome view of aberrations in the follicular 
lymphoma data. The plot is based on all 1 00 biopsies, and 
aberrations were defined as copy number estimates above 0.05 (for 
gains) or below -0.05 (for losses). Aberration frequencies are shown in 
red for gains and green for losses. Correlations between the copy 
number activity at different genomic locations are shown as arcs 
(blue for positive correlations and yellow for negative correlations), 
using a correlation threshold of ±0.68 to determine which 
correlations to display. Aberration frequencies are based on the 
segmentation found with single sample PCF (with y = 16 and 
km in = 3), while correlations are based on the segmentation found 
with multi-sample PCF (with y = 6). The plot was created with the 
function plotCircle in copynumber. 



from earlier ones, and supports the alternative hypothesis 
of parallel evolution in different lymph nodes. 

Allele-specific copy number analysis in breast cancer 

Copy number alterations have been extensively studied in 
breast cancer. To what degree gains and losses are asso- 
ciated only with certain alleles has been less studied. In a 
recent study, genotyping of 112 breast carcinoma samples 
was performed using Illumina 109K SNP arrays, and the 
ASCAT method was used to infer the allele-specific copy 
numbers at each locus [26]. However, to do this we first 
had to segment the data; for this purpose we applied allele- 
specific PCF segmentation to all samples. In Figure 7, the 
result of this segmentation is shown for one particular 
sample and two different chromosomes. In Figure 7a, the 
segmentation of chromosome 1 is shown, and we clearly 
identify three segments on the p-arm with copy numbers 
less than two, larger than two, and identical to two (we 
assume here that tumor ploidy is 2). Suppose we con- 
sider only germline heterozygous loci, in which case the 
allelic ratio is 1/2 when no aberrations are present (one 
copy of B and two copies in total). The BAF track reveals 



allelic imbalance in the first two segments, and more pro- 
nounced in the first segment than in the second. This is 
consistent with a loss of one copy in the first segment 
(i.e. a hemizygous loss, resulting in an allelic ratio of 0/1 
or 1/1 depending on whether the A-allele or the B allele 
is lost), and a gain of one copy in the second segment 
(resulting in an allelic ratio of 1/3 or 2/3 depending on 
which allele is gained). The third segment has an allelic 
ratio of 1/2. Notice that in case of allelic imbalance, the 
observed allele ratio is substantially closer to 0.5 than 
expected by the above theoretical ratios. This attenuation 
of the signal (which also affects the logR values) is due 
to technical issues like cross-hybridization, as well as the 
fact that in reality the tumor is a mixture of cells with 
normal DNA (two copies of each locus) and tumor cells 
with aberrant DNA. In Figure 7b we notice a sharp trough 
in the logR track on 17p, accompanied by an allelic ratio 
close to 0.5. 

If for a certain SNP locus one allele is substantially 
more frequently gained than the other allele, one may 
hypothesize that the former allele is subject to a larger 
selective pressure to change copy number. This, in turn, 
may be an indication of different roles being played by the 
two alleles with respect to cancer progression and evolu- 
tion, suggesting that loci subject to allelic skewness can 
be potential unique markers for breast cancer develop- 
ment. Even from a relatively small number of samples, 
probes with highly significant allelic skewness have been 
identified in a genome-wide statistical evaluation [26]. 

Outliers and Winsorization 

While least squares methods are often favored due to their 
optimality properties, they are also known to be sensitive 
to extreme observations. Thus, except if the purpose is to 
search for short aberrations of biological origins (CNVs), 
we advise the use of an outlier handling procedure. To 
evaluate the proposed Winsorization scheme, we first 
established a suitable way of simulating extreme observa- 
tions. A classical way is to use "contaminated normals", 
where the error distribution is a mixture of two normal 
distributions [35]. With probability 1— a the error is drawn 
from a distribution N(0, cr 2 ), and with probability a from 
N(0, d 2 cr 2 ), typically with d = 3 and a = 0.05. We com- 
pared the fraction of outliers in observed copy number 
data to the corresponding fractions in normals and con- 
taminated normals, using MAD to estimate SDs. For the 
normal distribution, the fraction of observations outside 
3 SD is 0.27% and outside 5 SD < 0.00001%, while these 
fractions for the 5% contaminated normal are 1.64% and 
0.42%. For the Agilent 244K used on the MicMa dataset, 
the fractions were 1.89% and 0.59%, that is, slightly above 
the values for the contaminated normal. For the 44K Agi- 
lent and Illumina 109K applied to the same data, the 
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Figure 7 Allele-specific PCF analysis of SNP array data. Results are shown for a breast carcinoma sample in the MicMa cohort for chromosome 1 
(panel a) and chromosome 1 7 (panel b). The points in the upper two panels show observed total copy numbers (logR) while the points in the lower 
two panels show observed B allele frequencies (BAF). The red curves show the result of applying the allele-specific PCF segmentation method to the 
data. The plot was created with the function plotAllele in copynumber. 



percentages were slightly lower (3 SD: 1.41% and 1.18%; 
for 5 SD 0.29% and 0.11%), however still indicating that 
the 5% contaminated normal is an appropriate distribu- 
tion when evaluating robustness of copy number assess- 
ment procedures. Inspection of data obtained by the 318K 
Illumina, 4 x 180K Agilent and Nimblegen 2.1M arrays 
also confirmed the existence of substantial amounts of 
outliers. The PCF algorithm was tested with and without 
Winsorization on simulated data with outliers (Table 1). 
Outliers in the contaminated distributions may cause the 
detection of short false aberrations; such spikes occurred 
roughly ten times per 1000 probes, as compared to less 
than two times per 1000 for uncontaminated data. Table 1 
further shows that Winsorization efficiently reduces the 
number of falsely detected aberrations and make results 
for the contaminated distribution roughly equal to the 



ones for the normal distribution. In line with these obser- 
vations, outliers tend to change the form of aberrations 
(their height and length), while Winsorization brings the 
distribution fairly close to the one found for normal data 
(data not shown). 

Another way to avoid that a few extreme observations 
result in a segment is to impose a lower limit on the length 
(number of probes) of a segment. With a lower length 
limit of five probes, we found about twice as many false 
spikes as with Winsorization when adjusting y to give 
equal sensitivity for true aberrations. Still, simulations 
indicate that a lower limit on segment length is valuable 
in combination with Winsorization. Note that outliers of 
biological origin will be more extreme if the technology 
has an inherent low noise level, as is the case, e.g., for 
BAC arrays and for high throughput sequencing. Thus, 



Table 1 Outlier effects 

Type Distribution Sensitivity(%) Speciflcity(%) False aberrations (%) 

Normal 79.5 96.5 0.15 

A Normal w/5%contam. 78.8 93.7 1.04 

Normal w/5% contam., Winsor. 78.1 96.0 0.13 

Normal 78.9 93.6 0.20 

B Normal w/5% contam. 77.8 90.6 1.06 

Normal w/5% contam., Winsor. 77.5 93.3 0.15 

Shown is the effect of Winsorization on simulated data with outliers and artificial (low-amplitude) aberrations. Two types of aberrations are considered: (A) aberrations 
of height 1 .5 and length 1 0 probes and (B) aberrations of height 1 .0 and length 30. The contamination consists of normals with SD=3 and the MAD estimate of SD 
equals 1 .0. Sensitivity is the percentage of amplified probes that are detected as amplified, while specificity is the percentage of non-amplified probes classified as 
such. The false aberration column gives the percentage of aberrations not covering the central part of the real amplifications. 
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outliers are not a sign of inappropriate functioning of a 
technique, but a characteristic of the data requiring con- 
sideration in the analysis. In summary, copy number data 
tend to contain a high fraction of outliers. These out- 
liers often induce false aberrations, but simple procedures 
like Winsorization will efficiently reduce these undesired 
effects. 

Computational performance 

In R, using the vector based PCF implementation 
described in Algorithm 1 implies a substantial efficiency 
gain over loop based implementation, roughly a 10-20 
times reduction in time requirements. The fast imple- 
mentation of PCF (Algorithm 3) gives a further marked 
reduction in computing time. On the MicMa 244k dataset 
(longest arm ^10000 probes), the implemented fast ver- 
sion is about 15 times faster than the exact one, and uses 
around 3.5 minutes to process the 49 samples (4 seconds 
per sample, see Table 2). The multi-sample method was 
slightly faster than the single sample version. 

The deviations between the solutions found by the exact 
PCF and fast PCF on the MicMa set were small; in terms 
of reduction in variance (difference between sample vari- 
ance and residual variance after fitting PCF curves) below 
0.01%. The differences observed for the curves were typ- 
ically small shifts in the border of aberrations. Thus, we 
conclude that the results from the fast procedure for 
practical purposes may be regarded as global solutions 
to (3), and the fast version is therefore used by default 
in copynumber. We also compared the performance 
of PCF with two other segmentation methods: Circu- 
lar Binary Segmentation (CBS) [4,36] and Fused Lasso 
Regression (FL) [12]. In comparison studies [5,8], CBS 
has shown good performance in terms of sensitivity and 
false discovery rate. It is probably the most commonly 
used freely available algorithm and is also implemented 
in several commercial analysis tools. CBS is available in 
the R package DNAcopy, which is used for this compar- 
ison. FL is a more recent proposal implemented in the 
R package cghFLasso, and is one of three preferred 
methods in the web-based segmentation tool CGHweb 
[37]. Using default parameter settings, we compared the 



computing times of PCF, CBS and FL on the 49 sam- 
ples in the 244 K MicMa data set, and on 6 samples 
from a 1.1 M Illumina SNP array (using the logR values). 
Table 2 gives the average computation time (in seconds) 
per sample. With no preprocessing of the data, PCF is 
on average 3-4 times faster than CBS on both data sets, 
and about 4 times faster than FL on the largest data set. 
Note that copynumber detects and operates on chromo- 
some arms, while DNAcopy operates on whole chromo- 
somes. This partly explains the difference in performance 
between PCF and CBS for the MicMa data set; for the 
Illumina data this has little impact due to the iterative 
approach used in PCF for the longest sequences. PCF was 
also markedly faster in evaluations based on simulated 
data; however, comparisons are complicated by the fact 
that the speed of CBS depends on the data in a nontriv- 
ial manner. As seen from the IQRs listed in parentheses 
in Table 2, the speed of CBS is quite variable from sample 
to sample while PCF and FL is nearly constant. Moreover, 
the table shows that CBS runs 2-3 times slower when out- 
liers have been removed using Winsorization, underlining 
that the performance of CBS is highly data dependent. 
We underline that the above-mentioned results only relate 
to the current R implementations. As mentioned in the 
introduction, PCF is conceptually similar to the CGHseg 
method described by Picard et al.[6], and we also exam- 
ined the computational performance of this method using 
the implementation in the R package cghseg. Using the 
version of CGHseg that requires a prespecified number 
of segments for each chromosome, the algorithm is fast, 
although the speed depends on the number of segments. 
Using the full CGHseg algorithm that automatically deter- 
mines the number of segments, the algorithm is very slow 
for high-resolution data. Hence, making a fair comparison 
between PCF and CGHseg is difficult. 

Segmentation accuracy 

We further compare the accuracy of the segmentation 
solutions found by PCF and CBS. Figure 3c shows ROC 
curves using MLPA classifications as the truth, and then 
applying a range of aberration calling thresholds to PCF 
estimates, CBS estimates, a running median with window 



Table 2 Computational performance 

Method R package Agilent 244K Illumina 1.1 M 

Raw data Outliers removed Raw data Outliers removed 

PCF copynumber 4(0.2) 4(0.2) 23(0.7) 22(0.4) 

Fused Lasso cghFLasso 5 (0.2) 5 (0.2) 97 (0.7) 99 (3.3) 

CBS DNAcopy 15(4.7) 35(4.1) 71(12.9) 219(12.8) 

The average computation time (in seconds) per sample is shown for copynumber (PCF), DNAcopy v1 .30.0 (CBS) and cghFLasso vO.2-1 (Fused Lasso) on the MicMa 
244 K data set (49 samples) and on the logR values from an Illumina 1 .1 M SNP array data set (6 samples). The IQR over samples is given in parenthesis. The methods 
were applied to both raw data and data with outliers removed using the Winsorization method. All tests were performed on a PC with a 2.93GHz Intel i7 CPU with 8 Gb 
of memory running Windows 7 and R 2.1 5.1 (64-bit). 
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size 50 and raw copy number data (details in Additional 
file 2). Results for PCF and CBS are similar, both achieving 
high sensitivity and specificity. The running median also 
gives good results, illustrating that many probes are fairly 
easy to classify and that the gain obtained by using meth- 
ods like CBS and PCF is mainly an improved classification 
close to borders between segments. We also repeated the 
simulation study in [8] where CBS was found to be the 
most sensitive method while also having the lowest false 
discovery rate. Again, we found that PCF and CBS had 
very similar performance (results not shown). A more 
detailed comparison of segmentation results shows that 
overall results are quite similar for single sample PCF and 
CBS, however for both methods the results depend on the 
choice of parameter values and the handling of extreme 
observations, see Additional file 3. In conclusion, PCF and 
CBS typically provide similar results and have equivalent 
accuracy when parameters are tuned appropriately. 

Conclusions 

Copy number segmentation based on least squares prin- 
ciples and combined with a suitable penalization scheme 
is appealing, since the solution will be optimal in a least 
squares sense for a given number of breakpoints. We 
have proposed a suite of platform independent algorithms 
based on this principle for independent as well as joint 
segmentation of copy number data. The algorithms per- 
form similarly as other leading segmentation methods 
in terms of sensitivity and specificity. Furthermore, the 
proposed algorithms are easy to generalize and are com- 
putationally very efficient also on high-resolution data. 
The Bioconductor package copynumber offers a user- 
friendly interface to the proposed algorithms. 

Several extensions and modifications of the proposed 
least-squares framework are possible. In principle, the 
L2-based distance measure used in the current implemen- 
tation of PCF is easily extended to general Lp-distances. 
However the current implementation is highly optimized 
for L2, and other distance measures would require sub- 
stantial heuristics to obtain comparable computational 
performance. Another extension is to introduce locus spe- 
cific penalties for breakpoints, thus essentially introducing 
a prior on the location of breakpoints. Work in progress 
includes specialized routines to handle high throughput 
sequencing data more efficiently and joint analysis of 
multiple samples in allele-specific PCF. 

Availability and requirements 

Project name: Copynumber 

Project home page: http://heim.ifi.uio.no/bioinf/ 

Projects/Copynumber/ 

Operating system(s): All systems supporting the R 
environment 

Programming language: R 



Other requirements: No 
License: GNU Artistic License 2.0. 



Additional files 



Additional file 1 : This pdf-file contains a formal description of the 
iterative PCF-based Winsorization algorithm. 

Additional file 2: This pdf-file contains a description of the three data 
sets used in this paper. 

Additional file 3: This pdf-file describes a comparison of 
segmentations performed by CBS and PCF on a MicMa sample. 
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